Projections using Hypertuned model through XGboost
All data is from FanGraphs. I have no affiliation with FanGraphs, but please consider contributing to their website if you found this project informative.
This project is designed to showcase how Using a Percentile Based Worth System values Fantasy Baseball Players through a Inning Pitched (IP) weighted projection
The Categories used for prediction valuation are year-end rankings for the following metrics: - HRs - Runs - RBIs - Batting Average - Stolen Bases
First we need to load the packages that R needs to run the analysis
library(sqldf) #SQL in R
library(skimr) #Summaries and useful for removing low % data
library(ggplot2) #Plotting Functions
library(plyr) #slightly deprecated data cleaning
library(dplyr) #slightly updated data cleaning
library(tidyverse) #tidyverse data cleaning universe
library(caret) #wrapper for creating, tuning and validating models
library(xgboost) #package for creating regression tree model
library(vtreat) # useful package for treating data before modeling
library(Matrix)
library(Boruta)
library(mgcv)
library(moments) #for measuring skewness
library(data.table) #alternative to dplyr we use to create lags
library(pdp) #partial dependence graphs
library(vip) #variable importance
library(grid) #put multiple plots on one grid
library(gridExtra) #additional grid functionality
library(janitor) #one function used to clean transposed data set
library(ggpubr) #for qq plot
library(tableHTML)
library(kableExtra)The # comments generally explain what additional functionality each library adds to R
All data is downloaded from Fan Graphs. From this location. The data is also available on my Github here. There are player level and team data sets
#data read-in
Batter_data <- read_csv("FanGraphs Leaderboard_Hitting50PA.csv")#Team datasets
FDG_Team = read_csv("FanGraphs Leaderboard_Team.csv")#Create a prefix for all team stats that starts with T_
FDG_Team2 <- FDG_Team %>%
rename_with( ~ paste0("T_", .x))str give information about an object, while
skim provides a customizable summary
#Output not shown for space
#str(FDG_Team2)
skim(FDG_Team2) %>%
tibble::as_tibble()NA
NAskim let’s us see how the data was imported into R.
Documentation can be found here
#Full Dataset dimensions
skimr::skim(Batter_data) %>%
tibble::as_tibble() %>%
select(skim_type,skim_variable,complete_rate) %>%
filter(complete_rate >0.30) #288 Variables
#skim_type - character or numeric
#skim_variable - name of variable
#complete_rate - % of data that is not missing
#filter - only keep variables that have 30% of data populatedAdditionally let’s look at how variables vary by year to see if there are any discrepancies there
#It looks like one year, there were fewer games played, and there is a clear drop off in home runs
Batter_data_dist =
Batter_data %>%
group_by(Season) %>%
summarize (Games_played = max(G),
Avg_HR= mean(HR)
)
Batter_data_dist
ggplot(Batter_data_dist, aes(Season, Avg_HR)) +
geom_col()+
ggtitle("Average Home Runs by Year")+
theme(plot.title = element_text(hjust = 0.5,size = 22,color ="steel blue"))NA
NA
NAWhat are some issues with the data?
Many of Variables, such as K%, are being read in as characters
There is spotty data coverage in some of the variables (~Variables have less than 30% Coverage)
2020 Data only includes 60 games worth of data
Team Data needs to be appended to Batter Data by Team Name
There are several ways to do this, we will identify the variables we
want to change that are mis-identified. parse_number can be
used to pull numbers from these variables. Additional ways to tackle
this can be found here
#Select Column names that are characters but not Team or Name, These should be percentages
Batter_data_chars_to_convert <- Batter_data %>%
select_if(is.character)%>% select(-Team,-Name) %>%
mutate_all (function(x) as.numeric(readr::parse_number(x))/100)
#Note : There are additional ways to do this, this is just one solution
#We can exclude the variables we converted and reintroduce them
Batter_data_num <- Batter_data %>% select(-colnames(Batter_data_chars_to_convert))
Batter_data2 = cbind(Batter_data_num,Batter_data_chars_to_convert) %>%
select (colnames(Batter_data)) %>% #preserve original order
dplyr::rename(flyball_perc = `FB%...46`,fastball_perc = `FB%...73`) #rename two ambiguous columns
skim(Batter_data2) %>%
as_tibble() %>%
group_by(skim_type) %>%
count()
#Logical variables are R's best guess, in our case they are all NA's and will be removedThe same can be done for the Team Data that is loaded
#Select Column names that are characters but not Team or Name, These should be percentages
FDG_Team2_chars_to_convert <- FDG_Team2 %>%
select_if(is.character)%>% select(-T_Team) %>%
mutate_all (function(x) as.numeric(readr::parse_number(x))/100)
#Keep in mind, parse number may make actual characters into numerical variables so carefully check your data before using
#We can exclude the variables we converted and reintroduce them
FDG_Team2_num <- FDG_Team2 %>% select(-colnames(FDG_Team2_chars_to_convert))
FDG_Team3 = cbind(FDG_Team2_num,FDG_Team2_chars_to_convert) %>%
select (colnames(FDG_Team2)) %>% #preserve original order
dplyr::rename(T_flyball_perc = `T_FB%...45`,T_fastball_perc = `T_FB%...72`)
skim(FDG_Team3) %>%
as_tibble() %>%
group_by(skim_type) %>%
count()NA
NA
NAI choose 30% coverage of data necessary but this can be adjusted up
or down. This will also get rid of columns that are all
NA.
# Keep variables with enough values (Need 30% data coverage rate here)
Player_cols_to_keep =
skim(Batter_data2) %>%
dplyr::select(skim_type, skim_variable, complete_rate) %>%
filter (complete_rate > 0.30)
#Transpose Rows to get column names as skim melts the data
Player_cols_to_keep_transpose = t(Player_cols_to_keep)
#extract the colnames we would like to keep
Player_cols_to_keep = colnames(janitor::row_to_names(Player_cols_to_keep_transpose,row_number = 2))
#Only keep the columns designated to have over 30% of their data populated or greater
Batter_data3 = Batter_data2 %>%
select(one_of(Player_cols_to_keep)) Repeat the process for Team Variables
Team_cols_to_keep =
skim(FDG_Team3) %>%
dplyr::select(skim_type, skim_variable, complete_rate) %>%
filter (complete_rate > 0.30)
#Transpose Rows to get column names as skim melts the data
Team_cols_to_keep_transpose = t(Team_cols_to_keep)
#extract the colnames we would like to keep
Team_cols_to_keep = colnames(janitor::row_to_names(Team_cols_to_keep_transpose,row_number = 2))
#Only keep the columns designated to have over 30% of their data populated or greater
FDG_Team4 = FDG_Team3 %>%
select(one_of(Team_cols_to_keep)) Some Variables will need to be normalized by Plate Appearances (PA) if they aren’t a percentage already. Remaining Variables are percentages or indicies so will not need to be transformed
Batter_data4 = Batter_data3 %>%
mutate( #create new variables based on existing variables
H_PA = H/PA,
x1B_PA = `1B`/PA, #note: R can't have variables start with a number
x2b_PA = `2B`/PA,
x3b_PA = `3B`/PA,
HR_PA = HR/PA,
R_PA = R/PA,
RBI_PA = RBI/PA,
BB_PA = BB/PA,
IBB_PA = IBB/PA,
SO_PA=SO/PA,
HBP_PA=HBP/PA,
SF_PA=SF/PA,
SH_PA=SH/PA,
GDP_PA= GDP/PA,#ground into double play
SB_PA=SB/PA,
CS_PA=CS/PA,
GB_PA = GB/PA, #Groundballs
FB_PA = FB/PA, #FlyBalls
LD_PA = LD/PA, #LineDrives
IFFB_PA = IFFB/PA, #Infield Fly balls
Pitches_PA= Pitches/PA,
Balls_PA= Balls/PA,
Strikes_PA= Strikes/PA,
IFH_PA= IFH/PA,
BU_PA= BU/PA,
BUH_PA= BUH/PA,
PH_PA= PH/PA,
Barrels_PA= Barrels/PA,
HardHits_PA= HardHit/PA
) %>% select(-(H:CS),-(GB:BUH),-PH,-Barrels,-HardHit,-Events) #Drop the old variables
#skim(Batter_data4) %>% as_tibble()Repeat the process for Team Variables
FDG_Team5 = FDG_Team4 %>%
mutate( #create new variables based on existing variables
T_H_T_PA = T_H/T_PA,
T_x1B_T_PA = T_1B/T_PA, #note: R can't have variables start with a number
T_x2b_T_PA = T_2B/T_PA,
T_x3b_T_PA = T_3B/T_PA,
T_HR_T_PA = T_HR/T_PA,
T_R_T_PA = T_R/T_PA,
T_RBI_T_PA = T_RBI/T_PA,
T_BB_T_PA = T_BB/T_PA,
T_IBB_T_PA = T_IBB/T_PA,
T_SO_T_PA=T_SO/T_PA,
T_HBP_T_PA=T_HBP/T_PA,
T_SF_T_PA=T_SF/T_PA,
T_SH_T_PA=T_SH/T_PA,
T_GDP_T_PA= T_GDP/T_PA,#ground into double play
T_SB_T_PA=T_SB/T_PA,
T_CS_T_PA=T_CS/T_PA,
T_GB_T_PA = T_GB/T_PA, #Groundballs
T_FB_T_PA = T_FB/T_PA, #FlyBalls
T_LD_T_PA = T_LD/T_PA, #LineDrives
T_IFFB_T_PA = T_IFFB/T_PA, #Infield Fly balls
T_Pitches_T_PA= T_Pitches/T_PA,
T_Balls_T_PA= T_Balls/T_PA,
T_Strikes_T_PA= T_Strikes/T_PA,
T_IFH_T_PA= T_IFH/T_PA,
T_BU_T_PA= T_BU/T_PA,
T_BUH_T_PA= T_BUH/T_PA,
T_PH_T_PA= T_PH/T_PA,
T_Barrels_T_PA= T_Barrels/T_PA,
T_HardHits_T_PA= T_HardHit/T_PA
) %>% select(-(T_H:T_CS),-(T_GB:T_BUH),-T_PH,-T_Barrels,-T_HardHit,-T_Events) #Drop the old variables
#skim(FDG_Team5) %>% as_tibble()There are several ways to lag a dataset BY
GROUP.
* Dplyr way is here..
* While data.table (the method used below) is here.
#Note we will only be lagging the player level data, as the previous year's team performance shouldn't impact current performance
#Order the dataset by lag columns
Batter_data5 = arrange(Batter_data4, playerid,Season) #playerid is the Fangraph id assigned to each player
# Convert dataframe to data.table format
DT_batter = data.table(Batter_data5)
#designate columns to lag - which is all of them
cols1 = colnames(Batter_data5)
anscols = paste("lag", cols1, sep="_")
DT_batter[, (anscols) := data.table::shift(.SD, 1, NA, "lag"),by ='playerid', .SDcols=cols1] #Create 1 period lags by year
Batter_data6 = as.data.frame(DT_batter) %>% select(-lag_playerid, -lag_Team, -lag_Season, -lag_Age,-lag_Name)
ncol(Batter_data5) #287 - no lags[1] 259
ncol(Batter_data6) #574 - lagged data ~ (287 * 2)-5[1] 513
We can use either the merge function or the SQL
functionality provided by the sqldf package to join the
lagged player level data to the Team level data
df_batting_init = sqldf(
"
select a.*, b.*
from Batter_data6 a
left join FDG_Team5 b
on a.Team = b.T_Team and a.Season = b.T_Season
"
) %>% select(-T_Team,-T_Season,T_Age,T_G,T_AB)# Unncessary Team Variables
nrow(df_batting_init) - nrow(Batter_data6) #check if any rows are duplicated[1] 0
We can use Percentile based ranking to get rankings for players from the 2021 season.
Each player goes from a 0% to 100% on each percentile stat that is used for creating a scoring opportunity. All data is already normalized by plate appearances, but must now be ranked for each year.
#Categories I include are:
#Runs (R), Home Runs (HR), Runs Batted In (RBI), Stolen Bases (SB), Batting Average (AVG)
df_batting_init2 = df_batting_init %>%
# arrange(player_id,year) %>%
group_by(Season) %>%
mutate(
Runs_share = order(order(rank(R_PA,ties.method = 'average'),decreasing = FALSE))/n(),
HR_share = order(order(rank(HR_PA,ties.method = 'average'),decreasing = FALSE))/n(),
RBI_share = order(order(rank(RBI_PA,ties.method = 'average'),decreasing = FALSE))/n(),
SB_share = order(order(rank(SB_PA,ties.method = 'average'),decreasing = FALSE))/n(),
AVG_share = order(order(rank(AVG,ties.method = 'average'),decreasing = FALSE))/n(),
OPS_share = 0,
Worth = Runs_share+HR_share+RBI_share+SB_share+AVG_share+OPS_share
) %>%
ungroup() Chart of the Distribution of initial percentiles
As the chart below shows, the data is roughly normal.
skewness((df_batting_init2$Worth))[1] -0.25
ggplot2::qplot(df_batting_init2$Worth, main="Total Dataset") + geom_histogram(colour="black", fill="grey") + theme_bw()`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
min(df_batting_init2$Worth)[1] 0.029
max(df_batting_init2$Worth)[1] 4.8
ggpubr::ggqqplot(df_batting_init2$Worth)
shapiro.test(df_batting_init2$Worth)
Shapiro-Wilk normality test
data: df_batting_init2$Worth
W = 1, p-value <0.0000000000000002
Total Rankings for the players (Using 5x5 Scoring) can be found here
options(digits=2)
df_batting_init2021 =
df_batting_init2 %>%
group_by(Name) %>%
filter(Season == 2021) %>%
arrange(desc(Worth)) %>%
select(Name,Runs_share,HR_share,RBI_share, SB_share,OPS_share,AVG_share,Worth)
df_batting_init2021 %>%
filter (Worth>3.9) %>%
kbl() %>%
kable_material(c("striped", "hover","condensed","responsive"),full_width = F,fixed_thead = T)| Name | Runs_share | HR_share | RBI_share | SB_share | OPS_share | AVG_share | Worth |
|---|---|---|---|---|---|---|---|
| Fernando Tatis Jr. | 0.99 | 1.00 | 0.98 | 0.96 | 0 | 0.89 | 4.8 |
| Ronald Acuna Jr. | 1.00 | 0.98 | 0.82 | 0.96 | 0 | 0.90 | 4.7 |
| Byron Buxton | 1.00 | 0.99 | 0.70 | 0.92 | 0 | 0.97 | 4.6 |
| Tyler O'Neill | 0.96 | 0.97 | 0.86 | 0.88 | 0 | 0.91 | 4.6 |
| Jose Ramirez | 0.98 | 0.93 | 0.94 | 0.94 | 0 | 0.78 | 4.6 |
| Teoscar Hernandez | 0.92 | 0.90 | 1.00 | 0.80 | 0 | 0.94 | 4.6 |
| Kyle Tucker | 0.87 | 0.89 | 0.94 | 0.86 | 0 | 0.94 | 4.5 |
| Bryce Harper | 0.96 | 0.95 | 0.80 | 0.82 | 0 | 0.97 | 4.5 |
| Bo Bichette | 0.99 | 0.76 | 0.85 | 0.92 | 0 | 0.95 | 4.5 |
| Shohei Ohtani | 0.95 | 0.99 | 0.90 | 0.94 | 0 | 0.69 | 4.5 |
| Javier Baez | 0.87 | 0.93 | 0.92 | 0.91 | 0 | 0.77 | 4.4 |
| Frank Schwindel | 0.97 | 0.90 | 0.95 | 0.55 | 0 | 0.99 | 4.4 |
| Vladimir Guerrero Jr. | 0.99 | 0.98 | 0.92 | 0.49 | 0 | 0.98 | 4.4 |
| Trea Turner | 0.96 | 0.78 | 0.63 | 0.97 | 0 | 1.00 | 4.3 |
| Brandon Crawford | 0.85 | 0.78 | 0.94 | 0.80 | 0 | 0.95 | 4.3 |
| Nick Castellanos | 0.95 | 0.94 | 0.97 | 0.47 | 0 | 0.97 | 4.3 |
| Marcus Semien | 0.94 | 0.96 | 0.80 | 0.81 | 0 | 0.77 | 4.3 |
| Juan Soto | 0.96 | 0.80 | 0.83 | 0.70 | 0 | 0.99 | 4.3 |
| Luis Robert | 0.84 | 0.79 | 0.83 | 0.81 | 0 | 1.00 | 4.3 |
| Brandon Belt | 0.97 | 0.99 | 0.89 | 0.56 | 0 | 0.85 | 4.3 |
| Paul Goldschmidt | 0.90 | 0.83 | 0.84 | 0.76 | 0 | 0.94 | 4.3 |
| Rafael Devers | 0.91 | 0.93 | 0.97 | 0.55 | 0 | 0.88 | 4.2 |
| Manny Machado | 0.85 | 0.79 | 0.95 | 0.78 | 0 | 0.88 | 4.2 |
| Jorge Polanco | 0.90 | 0.88 | 0.88 | 0.76 | 0 | 0.80 | 4.2 |
| A.J. Pollock | 0.64 | 0.87 | 0.94 | 0.82 | 0 | 0.95 | 4.2 |
| George Springer | 0.98 | 0.97 | 0.84 | 0.66 | 0 | 0.76 | 4.2 |
| Mike Trout | 0.93 | 0.91 | 0.67 | 0.69 | 0 | 1.00 | 4.2 |
| Aaron Judge | 0.83 | 0.96 | 0.88 | 0.60 | 0 | 0.92 | 4.2 |
| Ozzie Albies | 0.90 | 0.78 | 0.88 | 0.89 | 0 | 0.71 | 4.2 |
| Matt Olson | 0.89 | 0.94 | 0.94 | 0.50 | 0 | 0.83 | 4.1 |
| Adam Engel | 0.89 | 0.88 | 0.72 | 0.97 | 0 | 0.64 | 4.1 |
| Thairo Estrada | 0.85 | 0.89 | 0.96 | 0.55 | 0 | 0.85 | 4.1 |
| Brandon Lowe | 0.93 | 0.97 | 0.93 | 0.66 | 0 | 0.59 | 4.1 |
| Avisail Garcia | 0.73 | 0.92 | 0.96 | 0.72 | 0 | 0.74 | 4.1 |
| Freddie Freeman | 0.98 | 0.80 | 0.64 | 0.66 | 0 | 0.95 | 4.0 |
| Yordan Alvarez | 0.91 | 0.91 | 0.98 | 0.34 | 0 | 0.87 | 4.0 |
| Tim Anderson | 0.97 | 0.57 | 0.54 | 0.90 | 0 | 0.98 | 4.0 |
| Jesse Winker | 0.93 | 0.87 | 0.84 | 0.36 | 0 | 0.96 | 4.0 |
| Kyle Schwarber | 0.95 | 0.98 | 0.87 | 0.36 | 0 | 0.78 | 3.9 |
| Chas McCormick | 0.88 | 0.79 | 0.90 | 0.68 | 0 | 0.69 | 3.9 |
| Jake Meyers | 0.77 | 0.68 | 0.98 | 0.77 | 0 | 0.72 | 3.9 |
NANot all variables can be used for predictive modeling.
df_batting_init3 = df_batting_init2Lag Share Variables to use for predictive modeling. The variables that we created for the Worth metric must also be removed. This will create the final dataset.